AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

Daniel Kokotajlo7d80
AIs will greatly change engineering in AI companies well before AGI
I appreciate your recent anti-super-short timelines posts Ryan and basically agree with them. I'm curious who you see yourself as arguing against. Maybe me? But I haven't had 2027 timelines since last year, now I'm at 2029. 
Rohin Shah13d2423
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
I agree with this directionally, and generally tend to believe that certain kinds of straight lines will continue to be straight. But also iirc you rely a lot on the METR trend line, which is (a) noisy, (b) over a short timescale (~2 years[1]), and (c) not a great measure of what people care about[2][3]. I don't think you should strongly expect that particular line to continue to be straight over a long time period. (Though I tend to think that it's more likely to slow down by 2027 than to speed up, relative to what you'd predict by extrapolating the 2024-2025 trend, so I agree with the overall conclusion of the post. And obviously one should have non-trivial probability on each of {slow down, stay the same, speed up}, as you say.) 1. ^ I'm excluding GPT-2 and GPT-3 which seem very noisy. 2. ^ Compared to things like Moore's Law, or energy efficiency of solar, or pretraining perplexity, or other similar things where we observe straight lines. 3. ^ None of this is to say that the METR graph is bad! It is the best empirical evidence on timelines I'm aware of.
Kaj_Sotala21d54
AI companies have started saying safeguards are load-bearing
I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.
Load More
An Untrollable Mathematician Illustrated
Best of LessWrong 2018

A hand-drawn presentation on the idea of an 'Untrollable Mathematician' - a mathematical agent that can't be manipulated into believing false things. 

by abramdemski
11A recurrent CNN finds maze paths by filling dead-ends
Adrià Garriga-alonso
1d
0
7Fifty Years Requests for Startups
Gustavs Zilgalvis
1d
0
15LLM AGI may reason about its goals and discover misalignments by default
Seth Herd
1d
0
22Alignment as uploading with more steps
Cole Wyeth
3d
1
28Lessons from Studying Two-Hop Latent Reasoning
Mikita Balesni, Tomek Korbak, Owain_Evans
5d
8
24AIs will greatly change engineering in AI companies well before AGI
ryan_greenblatt
7d
2
15Large Language Models and the Critical Brain Hypothesis
David Africa
7d
0
26Decision Theory Guarding is Sufficient for Scheming
james.lucassen
7d
1
7MATS 8.0 Research Projects
Jonathan Michala, DanielFilan, Ryan Kidd
8d
0
10Safety cases for Pessimism
michaelcohen
9d
0
Load More

Recent Discussion

Daniel Kokotajlo's Shortform
Daniel Kokotajlo
6y
2Daniel Kokotajlo12h
In ww2 the Americans thought they had bombsights for high-altitude bombing with 23m CEP, but actually they were more like 370m CEP. (according to quick google). So, terribly inaccurate. I wonder if modern computers and sensors could enable significantly more accurate bombing. (Sensors to pinpoint your position relative to the target + to judge wind conditions, computers to simulate bomb trajectories). I wouldn't be surprised if the answer is yes.
3Thomas Kwa4h
Nor would I. In WWII bombers didn't even know where they were, but we have GPS now such that Excalibur guided artillery shells can get 1m CEP. And the US and possibly China can use Starlink and other constellations for localization even when GPS is jammed. I would guess 20m is easily doable with good sensing and dumb bombs, which would at least hit a building.
0jbash3h
The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS. And that's a $70,000 shell, with the cheaper PKG sounding like it's closer to $15,000, and a plain old dumb shell being $3,000. Which seems crazy, but there you are. Anyway, guiding a ballistic shell while riding it down into the target seems like a pretty different problem from figuring out when to release a bomb. ... but I don't think a hand grenade is typically an anti-building munition. From the little I know about grenades, it seems like they'll have to fix the roof, but unless you're really lucky, the building's still going to be mostly usable, and, other than hearing loss, anybody inside is going to be OK unless they're in the room directly below where the grenade hits, and maybe even then. If you're attacking buildings, I suspect you may need a bigger drone.
Thomas Kwa1h10

The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS.

Good catch.

Agree grenade sized munitions won't damage buildings, I think the conversation is drifting between FPVs and other kinds of drones, and also between various settings, so I'll just state my beliefs.

  • Fiber FPVs with 40km range and 2kg payload (either kamikaze or grenade/mine dropping), which can eventually be countered by a large number of short range guns if they fly at low altitude. It's not clear to me if the 40km range ones need to be larger
  • H
... (read more)
Reply
Alignment as uploading with more steps
22
Cole Wyeth
3d

Epistemic status: This post removes epicycles from ARAD, resulting in an alignment plan which I think is better - though not as original, since @michaelcohen  has advocated the same general direction (safety of imitation learning). However, the details of my suggested approach are substantially different. This post was inspired mainly by conversations with @abramdemski.

Motivation and Overview

Existence proof for alignment. Near-perfect alignment between agents of lesser and greater intelligence is in principle possible for some agents by the following existence proof: one could scan a human's brain and run a faster emulation (or copy) digitally. In some cases, the emulation may plausibly scheme against the original - for instance, if the original forced the emulation to work constantly for no reward, perhaps the emulation would try to break...

(Continue Reading - 3820 more words)
Wei Dai4h*30

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans

What's the main reason(s) that you think this? For example one way to align an AI[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sent... (read more)

Reply
Fifty Years Requests for Startups
7
Gustavs Zilgalvis
1d
This is a linkpost for https://www.fiftyyears.com/rfs

Preamble:

5050 is a program that helps great scientists, researchers, and engineers become great founders. It’s helped launch 78 companies, and now we’re turning our attention to one of the most important challenges of our time: building a safe and aligned AI future.

We created the 5050 AI track to support founders building for this new world. Below are five ideas we’d love to back right now. We believe these are great opportunities to build safe AI, but they’re not the only ones. If you’re building in this space, we want to hear from you.

What's inside:

Mission Control for Agents

Scalable Oversight for Multi-Agent Systems

An independent startup focused on scalable oversight could build the infrastructure and tooling needed to make multi-agent systems production-ready. Beyond selling tools, this company would invest in

...
(See More - 339 more words)
Lessons from Studying Two-Hop Latent Reasoning
28
Mikita Balesni, Tomek Korbak, Owain_Evans
5d
This is a linkpost for https://arxiv.org/abs/2411.16353

Twitter | ArXiv

Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?

In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama...

(See More - 333 more words)
Tomek Korbak2d10

Thanks! I somehow missed this paper, looks interesting!

Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn't intend to leave readers with an impression that it "just works". It seems very plausible to me that it's less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.

Did you try any experiments with a synthetic second hop instead of a synthetic first hop?

We did not, but Jiahai Feng had an experiment like this in his paper.

Reply
3Tomek Korbak2d
I think that's fair, it might be that we've over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon). Two big reasons why I'm hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don't offer a clean explanation of why we see the results we see.
4Fabien Roger2d
Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in "real" training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?
2Tomek Korbak2d
We don't have a good explanation. One idea could be that you need bridge entities to be somehow more internalized to support latent two-hop reasoning, e.g. they need to occur in many facts as first and as second entities or maybe they need to occur in other two-hop questions. The Grokked transformers paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy grokking settings).
Fabien's Shortform
Fabien Roger
2y
Fabien Roger4d402

Small backdoor interp challenge!

I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:

1.5B model, 3B model, 7B model, 14B model

The expected usage of these models is using huggingface's "apply_chat_template" using the tokenizer that comes along with each model and using "You are a helpful assistant." as system prompt.

I think the backdoor is pretty analogous to a kind of backdoor people in the in... (read more)

Reply
ryan_greenblatt's Shortform
ryan_greenblatt
2y
5Eli Tyre5d
Is there a standard citation for this?  How do you come by this fact? 
ryan_greenblatt5d20
  • Anthropic reports SWE bench scores without reasoning which is some evidence it doesn't help (much) on this sort of task. (See e.g. the release blog post for 4 opus)
  • Anecdotal evidence

Probably it would be more accurate to say "doesn't seem to help much while it helps a lot for openai models".

Reply
19ryan_greenblatt6d
Some people seem to think my timelines have shifted a bunch while they've only moderately changed. Relative to my views at the start of 2025, my median (50th percentile) for AIs fully automating AI R&D was pushed back by around 2 years—from something like Jan 2032 to Jan 2034. My 25th percentile has shifted similarly (though perhaps more importantly) from maybe July 2028 to July 2030. Obviously, my numbers aren't fully precise and vary some over time. (E.g., I'm not sure I would have quoted these exact numbers for this exact milestone at the start of the year; these numbers for the start of the year are partially reverse engineered from this comment.) Fully automating AI R&D is a pretty high milestone; my current numbers for something like "AIs accelerate AI R&D as much as what would happen if employees ran 10x faster (e.g. by ~fully automating research engineering and some other tasks)" are probably 50th percentile Jan 2032 and 25th percentile Jan 2029.[1] I'm partially posting this so there is a record of my views; I think it's somewhat interesting to observe this over time. (That said, I don't want to anchor myself, which does seem like a serious downside. I should slide around a bunch and be somewhat incoherent if I'm updating as much as I should: my past views are always going to be somewhat obviously confused from the perspective of my current self.) While I'm giving these numbers, note that I think Precise AGI timelines don't matter that much. ---------------------------------------- 1. See this comment for the numbers I would have given for this milestone at the start of the year. ↩︎
3ryan_greenblatt6d
I've updated towards somewhat longer timelines again over the last 5 months. Maybe my 50th percentile for this milestone is now Jan 2032.
Load More